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Abstract — We present algorithmic improvements and a hard- 
ware architecture for list SC decoding of polar codes. More 
specifically, we show how to completely avoid copying of the log- 
likelihoods, which is algorithmically the most cumbersome part 
of list SC decoding. The hardware architecture was synthesized 
using a UMC 90nm VLSI technology, resulting in a decoder 
which can achieve a coded throughput of up to 103 Mbps. 

Index Terms — Polar codes, list SC decoding, VLSI. 

I. Introduction 

CHANNEL polarization gives rise to an elegant and prov- 
ably good class of channel codes, called polar codes Q]. 
The main idea behind polar codes is that, by using a simple 
polarizing transform on a channel n times, we can construct 
N = 2" channels that are either better (in terms of mutual 
information) or worse than the original channel. It was shown 
by Ankan in 1 1 1 that, as n is increased, almost all channels 
become either perfect (i.e., noiseless) or completely useless. 
Furthermore, the fraction of noiseless channels goes to the 
mutual information between the input and the output of the 
original channel. This gives rise to the following uniform 
capacity achieving coding scheme for binary input channels: 
use the noiseless channels to transmit information and fix the 
input of the useless channels to some value which is known 
to both transmitter and receiver. 

Decoding of polar codes is usually performed using 
a successive cancellation (SC) decoder with complexity 
0(N log N). Hardware architectures for SC decoding of polar 
codes were discussed in Q and 0, while the first ASIC 
of such a decoder was presented in H. Recently, more 
sophisticated decoding algorithms, such as the list SC decoder 
J5), |6] and the stack SC decoder [7], were introduced. These 
algorithms provide improved error correcting performance at 
the cost of increased complexity. Unfortunately, the list SC 
decoder is heavily burdened by a likelihood copying step and 
no architecture of such a decoder exists yet in the literature. 

Contribution and Outline: This letter presents an architec- 
ture for list SC decoding of polar codes. To this end, we also 
describe how the copying of the intermediate likelihoods in 
the list SC decoding algorithm can be avoided completely. In 
Section [II] we briefly review the construction and decoding of 
polar codes. Section III discusses algorithmic improvements, 
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while in Section IV the proposed list SC decoder architecture 
is presented. Section [V] summarizes VLSI implementation 
results and Section [VTl concludes this letter. 

II. Polar Codes 
Following the notation of Q), we use to denote a 
row vector (ai, . . . , a/v) and a\ to denote the subvector 




Fig. 1. SC decoding of bit 1 with N = 4. 

(a,, . . . , ctj). If j < i, then the subvector a\ is empty. We use 
log and In for the binary and natural logarithm, respectively. 

Let W denote a binary input discrete and memoryless 
channel (B-DMC) with input x £ {0, 1}, output y £ y, and 
transition probabilities A polar code is constructed 

by resursively applying a polarizing transform the channel 
W n times. This transform is linear and it can be expressed 
as a 2 x 2 matrix, denoted by F. The n-fold application of 
this transform can be expressed as an N x N matrix G, 
with G = F® n , where <E> denotes the Kronecker product. 
Encoding is performed by choosing an information sequence 
£ {0, 1}^ and calculating the codeword x 1 ^ = u^G. This 
codeword is transmitted over W and a noisy codeword y± is 
received. 



A. SC Decoding 

The decoding method proposed by Ankan is based on suc- 
cessive cancellation. First, an estimate for u\, denoted by v,x, is 
calculated based on y^. Then, u 2 is decoded, based on y± and 
the knowledge of u\, etc. In principle, it is possible to calculate 
the mutual information between (yf \u\~ l ) and u\ for every 
i, A polar code of rate R is constructed by letting the NR 
Ui's with the highest mutual information convey information, 
while freezing the remaining u^'s to some known value. If W 
is symmetric, then the frozen value can be for all frozen 
u^s and the resulting polar code is a linear block code. The 
exact decoding procedure is dictated by the recursive structure 
of the code. In Fig. [T] the decoding process is visualized 
for N = 4. On the right side of the graph, the likelihoods 
W(yi\%i), i £ 1, ...,N, Xi £ {0, 1} are available. These 
likelihoods are combined pair-wise by going through a data 
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Fig. 2. Decoding tree for N = 4. 

dependency graph with NlogN nodes. The output on the left 
side of the graph is -P(yf\ w^ -1 = 0, 1, i 6 1, ...,N. 

Decisions for non-frozen bits are taken according to the 
maximum likelihood (ML) rule 



= arg max P(y^,u\ 

Mi£{0,l} 



(1) 



For frozen bits, the frozen value is used as a decision. The 
circles and squares in Fig. [T] represent the two ways in which 
the two pairs of incoming likelihoods at each node, denoted by 
a\ and b\, are combined in order to produce the intermediate 
likelihoods, i.e., the functions / : [0, l] 4 — > [0, l] 2 and g : 
[0,1] 4 x {0,1} -> [0, l] 2 with fl] 



2 ( a i fe i 



a l b 2) i ^ ( a 2 b l + «2&2) 



9{a\ 



2 b\,, 



1 . 1 

2 a i+M 3 °i> 2 a2_ 



(2) 
(3) 



where u s is called a partial sum. Each partial sum is a linear 
combination of some of the previously decoded codeword 
bits If intermediate likelihoods are stored to avoid re- 
calculations, then each node is activated only once. Thus, 
exactly N log N applications of Q or ([3]) are needed and the 
computational complexity of SC decoding is O(NlogN). 

B. List SC Decoding 

Successive decoding can be described as a search procedure 
on a full binary tree. The 2^ -1 ^ nodes at depth (i — 1) 
represent Ui given all possible choices for . The two 
outgoing edges of each node are labeled with the two possible 
choices for Uj. A decoder explores one or more paths in the 
tree by deciding which edge to follow at each step based on 
some metric. The SC decoder, for example, explores only a 
single path from the root to the leaves of the tree. It uses 
the likelihood in ([TJ as a metric for non-frozen bits and it 
always follows the edge corresponding to the frozen value for 
frozen bits. An example is illustrated by the dashed red line 
in Fig. [2] The SC decoder has the drawback that erroneous 
decisions at some point can never be recovered in the future. 
The list SC decoder, on the other hand, performs a breadth- 
first search (BFS) on the tree under a complexity constraint. 
This constraint is enforced by discarding some of the paths 
at each step. More specifically, the list SC decoder with list 
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Fig. 3. Frame error rate (FER) performance of a polar code of length 
TV = 1024 bits under SC and list SC decoding with various list sizes. 



size L keeps track of L paths simultaneously. The list SC 
decoder also uses the likelihood in ([TJ as a metric for non- 
frozen bits. More formally, let (u^^l), . . . , u l 1 ~ 1 (L)) denote 
the L distinct decoding paths after the [i — l)-th bit has been 
decoded. For every path I G 1, . . . , L, there are two choices 
for Out of the resulting 2L paths, the L paths with the 

highest metric are preserved. When bit N is reached, the path 
with the highest metric is declared as the decoded codeword. A 
possible decoding trajectory for a list SC decoder with L = 2 
is shown by the green dotted line in Fig. [2] 

The performance of list SC decoding for = 1024 and for 
some realistic list sizes is compared with the performance of 
SC decoding in Fig. [3] We observe that the returns of increased 
list size are small for L > 4 and that at high SNR, using L > 2 
provides almost no gain. However, at a FER of 10 -2 , which is 
the target FER for many communications standards, the gain 
of list SC decoding is more prominent. 

III. Algorithmic Considerations 

The list SC decoder can be thought of as a combination of 
components. There are L metric computation units (MCUs) 
which calculate the metrics for each path using the sequential 
SC procedure. The MCUs have a memory which stores the 
intermediate likelihoods, the partial sums that correspond to 
each path, and the path itself. We call these three memories 
collectively the state of the MCU. Moreover, there is a compo- 
nent that manages the tree search by performing path selection 
based on the metrics that are calculated by the MCUs. 

A. Improved State Copying 

After the path selection step, each of the initial L paths is 
either discarded, kept, or duplicated, depending on whether it 
has zero, one, or two children nodes in the set of L largest 
metrics, respectively. In order to duplicate a path, the state 
of its associated MCU is copied to another MCU, with slight 
differences between the two copies that correspond to the two 
different values of Ui. It was shown in [5] that list SC decoding 
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Fig. 4. Overview of the proposed list SC decoder architecture. 

can be performed with complexity 0(LN log N) when using 
a lazy copy technique for state copying. However, there is 
no need to copy the intermediate likelihoods at all. If the 
intermediate likelihoods are stored in L memories, then each 
MCU only needs to know which memory to read from at each 
stage of SC decoding. Note that, since the channel likelihoods 
are never overwritten, only one copy is needed and all MCUs 
can read from this one copy. Moreover, the values calculated 
at stage needed only once, so they are not stored. 

For example, consider the code in Fig.[T] The path selection 
unit only needs to track where each MCU has to read from 
at stage 1. Assume that L = 2, that decoding has just started, 
that Mi,it3 are non-frozen and that u 2 ,U4 are frozen. List SC 
decoding starts with one (empty) path whose associated MCU 
reads from the first memory. When the MCU has finished 
calculating the metric, this path will be split into 2 paths, 
one for each choice of u\. Now the MCUs for both resulting 
paths have to read from the first memory at all stages of SC 
decoding, but the first and second MCUs write their output 
back to the first and second state memory, respectively. So, 
the correct likelihoods for the second MCU for the stages 
that are being processed from this point on are in the second 
memory. Since u 2 is frozen, we simply extend both paths with 
the frozen bit value and continue to u^. To decode 113, SC 
decoding goes back to stage 2 and uses the channel likelihoods 
to compute the likelihoods at stage 1. After stage 2 has been 
processed, the correct likelihoods for the second MCU at stage 
1 are stored in the second memory. The second MCU is 
informed that at stage 1 it has to read from the second memory 
by means of an auxiliary memory of dimension Lx (log N—2), 
which is called the pointer memory. The index of the memory 
that each MCU has to read from at each stage is stored in 
this memory. With this modification, instead of copying the 
intermediate likelihoods, it suffices to copy between the rows 
of this very small auxiliary memory. 

B. Likelihood Representation 

We represent the likelihoods in their negative log-likelihood 
(LL) form. This simplifies the computations in Q and Q, pro- 
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Fig. 5. FER performance of a polar code of length N = 1024 bits under 
floating-point and fixed-point list SC decoding for L = 2. 



vides numerical stability, and eases the binary representation 
because the LLs are non-negative numbers. More specifically, 
we use — \nW(yi\xi), Xi=Q,l, i = l,...,N, Using this 
representation, Q and ^ become 

f{a\,b\) = (min*(ai + b x ,a 2 
min*(a2 - 

g(af,bl,u s ) = (a 1+Us +h,a 2 - Ul 



62)), 



(4) 
(5) 



-6a), 
+ 62), 

where min* (a, b) = min (a, b) + In (1 + e 1° & l) . The / 
function is simplified by using an approximation that ignores 
the transcedental ln( ) term. In Fig. [5] the performance of the 
list SC decoder when using this approximation is plotted using 
a dashed line. It can be seen that there is practically no loss 
in performance with respect to the exact implementation. 

IV. List SC Decoder Architecture 

An overview of the proposed VLSI architecture is presented 
in Fig. [4] The proposed VLSI architecture contains two main 
components, i.e., the MCUs and the path selection unit. As 
already described, the MCUs are responsible for calculating 
the path metrics, while the path selection unit manages the tree 
search. Each of the L MCUs is composed of the LL, partial 
sum and path memories, and of a SC decoder core which 
performs the metric update based on the state that it is given 
by the path selection unit. An additional unit is responsible 
for redirecting the correct LLs to each decoder core. The 
path selection unit contains a sorter which finds the L best 
metrics out of 2L options, along with the path index and the 
value of Ui(l) from which they resulted. It also contains the 
pointer memory, which manages the memory read access of 
the decoder cores. A more detailed description of each unit 
follows. 

A. LL Quantization 

Since the LLs are positive numbers, as SC decoding moves 
towards stage 0, their dynamic range increases. When an LL 
pair overflows, it is useless for making a decision, since both 
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LLs will have the same value. Thus, when using LLs, it is 
crucial to avoid overflows. In Q and (p), two numbers with 
the same dynamic range are added, so that all overflows can be 
avoided by increasing the number of bits used to store the LLs 
by one bit per stage. This way, the only performance degrada- 
tion with respect to the floating point implementation comes 
from the quantization of the channel LLs. Let Q c h denote 
the number of bits used for the quantization of the channel 
LLs. The performance of the list SC decoder under various 
quantization bit-widths is presented in Fig. [5] We observe that, 
if we set Q c h = 3, then the degradation with respect to the 
floating point and to the Q c h = 4 implementations is very 
small. 

B. Metric Computation Units 

1) SC Decoder Cores: The architecture of the SC decoder 
cores is based on the log-likelihood ratio (LLR) based ar- 
chitecture of |0) with a modification to implement LL based 
SC decoding, because the path metrics can not be extracted 
from an LLR based decoder. Each decoder core consists 
of P processing elements (PEs) that operate on up to P 
nodes of each stage in parallel. The maximum bit-width, 
denoted by B max , determines the width of the PEs. We have 
B max = Q c h + log AT. The PEs implement both Q and (pi. 
An additional input is used to choose between the / and g 
outputs, depending on which function is needed at each stage. 

2) Memories: SC decoding can be implemented by using 
27V memory positions ||4). The N first positions that corre- 
spond to the channel LLs are never overwritten during SC 
decoding. Thus, only one copy of the channel LL memory is 
needed, from which all decoder cores can read. The remaining 
N memory positions for each decoder core have to be distinct. 
So, the number of required memory positions is (L + 1)N and 
the total number of bits, denoted by B (ot , is 

Btot = (2£ + 2)NQ ch + 4LN - 2L(log/V + Q ch + 2). (6) 

There are L partial sum and path memories, with N memory 
positions of 1 bit each. In order to complete the state copying 
step in a single cycle, all the contents of each of the L partial 
sum memories can be copied to and from one another by 
means of crossbars. The same holds for the path memories. 

C. Path Selection Component 

1 ) Metric Comparator: For the path selection step, the 2L 
metrics are sorted in a single cycle. To minimize the delay, a 
radix-2L sorter was used. This sorter requires 2L(2L — l)/2 
comparators of B max -bit quantities. Since a single sorter is 
needed, minimizing its size is not critical. In fact, the metric 
sorter occupies only 0.1% and 0.8% of the total decoder area 
for L = 2 and L = 4, respectively. A register was added 
between the output of the decoder cores and the metric sorter 
in order to reduce the length of the critical path. Unfortunately, 
because decoding can not proceed before the choice of paths 
is made, we effectively introduce an idle cycle every time the 
output of the metric sorter is needed. This happens RN times 



TABLE I 

Synthesis results for N = 1024. 
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1.60 mm 2 
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Norm. Cell Area 


1.60 mm 2 


3.53 mm 2 
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Clock Frequency 


266 MHz 


230 MHz 


150 MHz 


Coded Throughput 


103 Mbps 


89 Mbps 


98 Mbps 


Technology 


UMC 90 nm 


UMC 90 nm 


UMC 180 nm 



1 Size is reported for a fabricated chip. 

per codeword. Thus, by modifying the expression found in |4|, 
the number of cycles required to decode one codeword is 

N N 

C BA = {2 + R)N+-]og—. (7) 

If we ignore the second term, which is small, then the overhead 
with respect to the case where we do not add a register is 
RN cycles. Nevertheless, adding the register leads to a higher 
throughput due to a significantly higher clock frequency. 

2) Pointer Memory: The pointer memory contains L x 
(log N — 2) elements. Each element can take on L distinct 
values, so we need log L bits for the representation. In total, 
the pointer memory contains L\ogL(\ogN — 2) bits. For 
L = 2,4 and N = 1024, this translates to 16 and 64 bits, 
respectively. This memory also has the copying functionality 
that the partial sum and path memories provide. 

V. VLSI Synthesis Results 

Synthesis results for N = 1024 and L = 2, 4, using a 
UMC 90nm CMOS technology are shown in Table [I] There 
exists no other list SC decoder in the literature, so we can 
only provide a comparison with the existing SC decoder. 

VI. Conclusion 

In this work, the first list SC decoder architecture in the 
literature was presented. It was also shown how to avoid 
copying of the intermediate likelihoods by copying between 
pointers instead of the actual values. The proposed architecture 
scales well and has a throughput overhead with respect to an 
SC decoder of slightly less than 50i? percent. 
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